Bioinformatics A Practical Guide to Next Generation Sequencing Data Analysis (Hamid D. Ismail)

90 ◾ Bioinformatics

The coverage describes how often, on average, a reference sequence is covered by bases

from the reads. The sequencing depth is defined as the total number of reads that align to

or cover a specific locus (Figure 3.1).

The de novo assembly strategy has been greatly enhanced by the single-molecule real-

time (SMRT) sequencing (Pacific Biosciences (PacBio)) and nanopore sequencing (Oxford

Nanopore Technologies (ONT)), which produce long reads. In general, the de novo genome

assembly depends on the read length, sequencing depth, and the assembling algorithm.

Based on the sequencing technology used, reads can be classified into short reads, which

are produced by the most sequencing technologies including Illumina, and long reads pro-

duced by PacBio and ONT. In general, either sequence alignment or graphs approach is

used for the de novo assembly. The algorithms used by the de novo genome assemblers can

be classified into (i) greedy approach, (ii) overlap-layout-consensus with Hamiltonian path,

or (iii) de Bruijn graph with Eulerian path. The last two approaches share the idea that a

genome assembly can be represented as a path in graphs.

3.1.1 Greedy Algorithm

The greedy algorithm was the first one used for de novo genome assembly. Like multi-

ple sequence alignment, it depends on the similarity between reads to create a pileup of

aligned sequences, which are collapsed to create contigs from the consensus sequences. The

greedy algorithm uses pairwise alignment to compare all reads to identify the most simi-

lar reads with sufficient overlaps and then it joins them together. The reads with the sig-

nificant overlaps are merged. The algorithm continues by conducting pairwise alignments

on the merged reads to identify the sequences with significant overlaps and merges them.

This process will continue repeatedly until there is no more merging. When there is low

sequencing depth, gaps may be present between the assembled contigs; deep sequencing

and the use of paired-end reads can reduce gaps and allow longer contigs. Greedy assem-

blers include TIGR assembler [2] and PHRAP [3]. For instance, assume that we have the

reads CATTC, ATTCG, GTCAT, TTCGC, and GGCAT. Figure 3.2 shows how the greedy

algorithm works to assemble the consensus sequence GTCATTCGCAT from those reads.

3.1.2 Overlap-Consensus Graphs

The overlap graph method was first used for assembling genomes from sequences produced

by Sanger sequencing method and then it was extended by some assemblers for NGS reads.

FIGURE 3.1 Sequencing coverage and depth.